618 


NOTICE 

This material may oe 
protected by copyright 
UK. .HUH U U.i. CUCITT- 


|une9.1984 Vol. 144 THE MEDICAL JOURNAL OF AUSTRALIA 


Statistical significance and confidence intervals 

m yf.nr paper* in <h. Journal u» wh« may be reasonably inferred, given that *1 »«* KB 

/\/| »(•((•( icai method* and one of th« * dirfefem sample would have produced a If the data differ markedly from those wnicn 
X T X aims of the review process U to cry different result. would be expected under the null hypothesis* 


'any papers in the Journal use 
statistical methods and one of the 
» aims of the review process is to try 
to ensure that appropriate methods have 
been used. Often papers report results of 
comparative studies (hat are designed to 
answer questions such as whether one 
treatment is superior to another for a 
particular disease, or whether there is an 
association between some form of behaviour 
(for example, taking regular exercise or 
smoking) and the occurrence of some 
disease. Comparative studies are almost 
invariably carried out on a sample of 
individuals who are chosen from the 
population of individuals to whom it is 
intended to generalize the results. Data are 
collected on the sampte in order to make 
inferences on the population. Valid 
inferences can only be drawn if the sample 
is chosen in such a way that it is represen¬ 
tative of the population. Otherwise a bias 
could occur; epidemiological methods are 
designed to eliminate such biases. 

Since the aim of a statistical analysis is to 
make inferences, it is paramount to express 
whatever inferences that can be drawn in the 
most informative way. There are several 
methods of statistical inference, but the two 
that are most commonly used are 
significance testing and confidence interval 
estimation. The former is well known and 
is featured by quoting P values. Many 
authors appear to be under the impression 
that a profusion of P values is necessary; 
regrettably this impression has been bolstered 
in the past by editors of biological journals. 
Significance testing has its place but. as 
mentioned by Healy in 1978/ "it is widely 
agreed among statisticians (if less so among 
the more naive users of statistics) that 
significance testing is not the be-all and end- 
all of the subject". In this leading article I 
would like to discuss the characteristics of 
both methods of inference, show that a 
confidence interval contains the result of a 
significance test, but not vice versa, and 
suggest that confidence intervals are the 
answers to the more interesting questions 
that data can be used to answer. 

Any particular study is based on a 
particular sample; however, it is useful to 
imagine that the study is repeated with a 
different sample being selected each time. 
These hypothetical studies will give different 
results because they contain different 
individuals, and individuals vary in any 
characteristic because of biological varia¬ 
bility. The differences are termed sampling 
variability. It follows then that the results 
that are obtained from a particular sample 
can only be taken as an approximation to the 
actual situation in the whole population. 
Statistical methods are concerned *i>h 
assessing the degree of approximation and 


The methods arc based on the assumption 
that it is a matter of chance which particular 
subjects are in the sample that is being 
studied, and the sampling variability is thus 
random variation which is determined by the 
laws of probability. Therefore, the inferences 
are expressed in terms of probability. The 
situation is illustrated below. 

Population 


i 


sampling variation 


Sample data 


uncartainty 


Inferences on population 

Taking a sample from (he population 
involves sampling variation. As a conse¬ 
quence of this, inferences from the sampte 
data back to the population invotve 
uncertainty. 

A statistical analysis may be thought of as 
asking questions of the data. In an invest!* 
gat ion that compares two groups for the 
mean value of. for example, blood pressure 
or the prevalence of some disease, three 
questions may be posed: js there a difference 
between the groups?; How large is the 
difference?: andfriow accurately is the si2c 
of the difference known?. 

As expressed, the first question expects the 
answer “yes" or “no"; although the answer 
cannot be given in precisely these terms, it 
is often reduced to two possibilities. The 
appropriate methodology is th e sientftcance i 
test . The second question expects a numerical 
value to be the answer. This is an estimate 
and, as it is a single value, is referred to as 
a point estimate. In effect, the third question 
asks how reliable this point estimate is; the 
answer is a range of values which is referred 
to as an interval estimate or a confidence 
mterval, _ 

These questions represent two approaches 
to inference; hypothesis testing and 
estimation. Although at first sight they 
appear to be quite different, in concept they 
have much in common. Both make 
inferential statements about the value of a 
parameter. (A parameter is an unknown 
quantity which partly or wholly characterizes 
a population, for example, a mean or a 
measure of association.) 

The significance test is an appropriate 
technique when there is an a priori hypothesis 
To test. For the purpose of the statistical test 
This hypothesis is expressed in nu/F form ~ 
such as when no difference exists between 
groups — and the test evaluates whetherTHc 


to the extent that the probability of such an 
extreme result is low, then it is said that the 
result is statistically significant. Probability 
is measured on a continuum between 0 and 
1, but in significance testing a probability is 
considered low if it is less than conventional 
values such as 0.05 (5*) or 0.01 (1%). A 
significant result is equated with the rejection 
of the null hypothesis or the claim of a real 
effect. By definition, when the null 
hypothesis is true, significant results will 
occur by chance with the same relative 
frequency as the significance probability. 
That is, real effects will be claimed when the 
null hypothesis is true; however, the proba¬ 
bility of this etTor (type 1) is determined in 
the data analysis. 

One disadvantage of a significance test is 
that it may fail to detect a real effect; that 
is* although the nuB hypothesis is false, the 
evidence is not strong enough to reject it. The 
probability of this error (type 11) can be 
controlled at the design stage only* by 
appropriate selection of the sample size* and 
may be quite large. Thus* the trap of 
equating non-significance with no effect 
must be avoided; failure to reject the null 
hypothesis is not the same as accepting it. 

In the approach of confidence interval 
estimation no particular hypothesis is consi¬ 
dered: rather, the emphasis is on estimating 
those values of the parameter with which the 
data are consistent. These values form a 
range — the confidence interval. The range 
is calculated so that there is a high proba¬ 
bility — conventionally 95* or 99* — that 
it contains the true value of the parameter. 

A significance test is essentially a test of 
whether the data are consistent with a 
specified parameter value, and the confi¬ 
dence interval contains those parameter 
values with which the data are consistent. 
Therefore, a 5* significance test and a 95* 
confidence interval contain some infor¬ 
mation in common: significance implies that 
the null hypothesis value is outside the confi¬ 
dence interval; non-significance implies that 
the null hypothesis value is within the confi¬ 
dence interval. However, the confidence 
interval contains more information because 
it is equivalent to performing a significance 
test for all values of the parameter* not just 
a single value. A confidence interval enables 
a reader to see how large the effect may be. 
not simply whether it is different from zero. 

The limitations of the interpretations that 
are provided by a significance test may now 
be considered. _ 

The difference is significant. This means 
that there is a difference or, in other words, 
the size of the difference is not zero. We 
know no more than this. The difference may 


Source: https://www.industrydocuments.ucsf.edu/docs/mnbjOOOO 


2503012337 





THE MEDICAL JOURNAL OF AUSTRALIA VoL 144 June 9.1966 


619 


be Urge and of great importance or it may 
be mail and of no practical importance. It 
is unsatisfactory that the test provides no way 
of distinguishing between these quite 
different possibilities. 

The difference Is not significant. This 
means that there is insufficient evidence to 
enable us to conclude that there is a 
difference. So the difference may well be 
zero. But this is not the same as saying that 
it is zero. The true difference may be quite 
Urge. Again, it is unsatisfactory that this 
possibility is not addressed. 

The conclusions that may be drawn from 
a significance test are considered to be 
incomplete because it is rarely that one is 
interested solely in whether a null hypothesis 
is or is not true; indeed in many cases it may 
be recognized at the outset that the null 
hypothesis is unlikely to be true. Rather, the 
question is how large is the difference and 
is it possibly large enough to be important? 
The emphasis is on measuring rather than on 
testing. The addition of the concept of an 
important difference to that of a null 
hypothesis means that there are four possible 
interpretations to an analysis: (a) the 
difference is significant and Urge enough to 
be of practical importance; (b) the difference 
is significant but too small to be of practical 
importance; (c) the difference is not 
significant but may be large enough to be 
important; and (d) the difference is not 
significant and also not large enough to be 
of practical importance. 


The size of difference that is considered 
to be large enough to be important is a 
matter for debate, and genuine differences 
of opinion may arise. It is a medical, not a 
statistical, question, although a medical 
statistician who is experienced in the subject 
area could contribute to setting a value. The 
fact that tgrcement on a unique value may 
be impossible in no way detracts from the 
argument. In fact, expressing the results as 
a confidence interval enables interpretations 
to be made for any particular value that is 
considered appropriate. 

These possibilities are illustrated in the 
Figure where the confidence intervals arc 
shown. The significant and non-significant 
cases are distinguished by the confidence 
intervals that exclude or include zero respec¬ 
tively. The main point is that in each case 
the confidence interval gives the range of 
possible values for the true difference. Of 
particular concern is fc). Here there may be 
no true difference or there may be a large, 
important difference. In other words the 
study is completely inconclusive. Such a 
possibility is missed by the simple expression 
“not significant** with its lure of equating 
this falsely with “no effect**. This situation 
will arise with a study that is carried out on 
too small a sample and this is why good study 
design demands attention to sample size to 
try to prevent the occurrence of an incon¬ 
clusive result. Altman found that it was 
common for undue emphasis to be placed on 
“negative** findings from small studies,* 



Important Not important Inconcluaiva 


Trua nagatlv* 
raault 


FIGURE: Confidence intervals showing tour possible conclusions m terms ol statistical significance 
and practical importance. 


while Freimen et al. noted that “negative*’ 
trials were often too small to constitute a fair 
test of therapies. 1 Similarly, a significance 
test will contrast (b) as significant and (d) as 
not significant but fails to recognize that they 
give essentially the same cooc fa sfcni — that 
any difference is too small to be important. 

As an example, consider tome results 
which were obtained by Gang w a y ct tL from 
a clinical trial for the management of acme 
stroke in the ekkrfy.* Of 155 patients who 
were managed in a stroke unit, 71 were 
assessed as independent when they were 
discharged from the unit compared with 49 
of 152 who were managed in a medical unit. 
The simplest analysts shows that the 
difference between the success rates of the 
two units is significant at the 1% level. 
Therefore, a genuine effect has been estab¬ 
lished. To appreciate the importance of this 
effect the advantage of the stroke unit may 
be measured by the difference between the 
two units in the percentage of subjects 
who were discharged as independent; 
50.3* - 32.2* - U.l*. This is the point 
estimate. The accuracy of this estimate is 
given by its standard error (5.5) and the 95* 
confidence limits (7.3* and 28.9*). Thus, 
the gain could be as large as 29* or as small 
as 7*. 

Recently, Gardner and Altman have 
argued against the excessive use of hypothes is 
testing and urged a greater use of confidence 
intervals. 1 In an appendix to their paper they 
give methods to calculate confidence 
intervals for the commonly occurring two- 
sample comparisons. 

In presenting the main results of a study 
it is good practice to provide confidence 
intervals rather than to restrict the analysis 
to significance tern. Only by so doing can 
authors give readers sufficient information 
for a proper conclusion to be drawn; 
otherwise readers have to rely upon the 
authors* own interpretation.* Therefore, 
intending authors are urged to express their 
main conclusions in confidence interval form 
(possibly with the addition of a significance 
test, although strictly that would provide no 
extra information). One of (he aims of the 
Journal*! statistical review process will be to 
ensure that where possible this is done. 

GEOFFREY BERRY 

Associate Professor of Bkwatistics 
School of Public Health and Tropical Medicine 
The Unhrenfry of Sydney 

1. Htaly MJR. Ii statistics a science 4 JM Statist Soc A 
1*71; Ut: 3t5)f) 

2. Ahmaa DC. Statistics W aaedical Jowaab. Sm Med 
l*S2. 1: 3**7l. 

J. Fmrnin JA, Chitaet TC. Sail H Jt. KeAkf II. 
The importance of beta, the type U error and tempi* 
tut in the doifn and murpmauoo of Utr ra n d omi s e d 
control trial. N Enf t J WW l*7|; 2W: HWW 
e Carraway WM. Akhtar AJ. Prctcott XI. Hockey L. 
Management of acute woke m the ekkrty ptfau ua iry 
multi of a controlled trial. Br Mid J IWO; 2X0: 
1040-104). 

S. Gardner MJ. Altman DG Confidence mtrrvah rather 
than P values estimation rather than bypot beau 
iruin|. Br Med J 1M*. 2»2: 744-750. 


Source: https://www.industrydocuments.ucsf.edu/docs/mnbjOOOO 


2503022338 



